Skip to content

Conversation

@hpopuri2
Copy link

@hpopuri2 hpopuri2 commented Jan 27, 2026


##Add Valkey Distributed Cache for Horizontal Scaling

##Summary

This PR implements distributed caching using Valkey to enable horizontal scaling of Trino Gateway. Multiple gateway instances can now share query metadata through a distributed cache layer, ensuring consistent query routing across all instances.

##Motivation

Currently, Trino Gateway uses local Guava caches that are not shared between instances. In multi-instance deployments, this can lead to:

  • Inconsistent query routing when requests hit different gateway instances
  • Cache misses requiring expensive database lookups
  • Inability to leverage cache across horizontally scaled deployments

This implementation addresses these limitations while maintaining backward compatibility and graceful degradation.

##Architecture

3-Tier Caching Strategy

Request Flow:

  1. L1 Cache (Local Guava) → ~1ms
    - Hit: Return immediately
    - Miss: Check L2
  2. L2 Cache (Valkey Distributed) → ~5ms
    - Hit: Populate L1, return
    - Miss: Check L3
  3. L3 Cache (PostgreSQL Database) → ~50ms
    - Found: Populate L2 + L1, return
    - Not Found: Search backends via HTTP (~200ms)

Cache Keys

Three values are cached for each query:

  • trino:query:backend:{queryId} - Backend URL for query routing
  • trino:query:routing_group:{queryId} - Routing group assignment
  • trino:query:external_url:{queryId} - External URL for query access

All keys use configurable TTL (default 30 minutes / 1800 seconds).

##Implementation Details

Core Components

ValkeyConfiguration (gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java)

  • 9 configurable parameters with sensible defaults
  • Input validation (port range, positive values)
  • Convention over Configuration - only enabled, host, and port required
  • Fixed: cacheTtlSeconds parameter now properly used (was previously hardcoded)

Cache Interface (gateway-ha/src/main/java/io/trino/gateway/ha/cache/Cache.java)

  • Generic caching abstraction: get(), set(), invalidate(), isEnabled()
  • Implementation-agnostic design (name describes contract, not implementation)
  • Enables future alternative implementations (e.g., Redis Cluster, Memcached)
  • Located in dedicated io.trino.gateway.ha.cache package for better organization

ValkeyDistributedCache (gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java)

  • Implements Cache interface
  • JedisPool connection pooling with configurable pool size
  • Graceful degradation when disabled or connection fails
  • Configurable TTL management via cacheTtlSeconds parameter

QueryCacheManager (gateway-ha/src/main/java/io/trino/gateway/ha/cache/QueryCacheManager.java) - NEW

  • Encapsulates all query-related cache operations
  • Manages 3 LoadingCache instances (backend, routing group, external URL)
  • Provides clean separation of concerns between routing and caching logic
  • Handles both L1 (in-memory) and L2 (distributed) cache operations
  • Methods:
    • L1 operations: setBackendInL1(), getBackendFromL1(), etc.
    • L2 operations: cacheBackend(), getCachedBackend(), etc.
    • Combined operations: setBackend(), updateAllCaches()

NoopDistributedCache (gateway-ha/src/test/java/io/trino/gateway/ha/cache/NoopDistributedCache.java)

  • No-op implementation for testing without real cache
  • Always returns empty, always disabled

Integration

BaseRoutingManager - Simplified routing logic:

  • Now uses single QueryCacheManager instance instead of managing multiple caches
  • Reduced from ~380 to ~310 lines through better separation of concerns
  • updateQueryIdCache() method caches all 3 values via QueryCacheManager
  • All cache operations delegated to QueryCacheManager
  • findBackendForUnknownQueryId() - L1 → L2 → L3 → HTTP search
  • findRoutingGroupForUnknownQueryId() - L1 → L2 → L3 lookup
  • findExternalUrlForUnknownQueryId() - L1 → L2 → L3 lookup
  • Automatic cache backfilling when found in lower tiers

ProxyRequestHandler - Query submission:

  • Updated recordBackendForQueryId() to call updateQueryIdCache() with all 3 values
  • Ensures all query metadata is cached on first submission

HaGatewayProviderModule - Dependency injection:

  • @provides @singleton Cache method
  • Wires ValkeyConfiguration to ValkeyDistributedCache
  • Passes cacheTtlSeconds from configuration to cache implementation

Configuration

Minimal (Recommended for Getting Started)

valkeyConfiguration:
  enabled: true
  host: localhost
  port: 6379

With Authentication

valkeyConfiguration:
  enabled: true
  host: valkey.internal.prod
  port: 6379
  password: ${VALKEY_PASSWORD}
  database: 0

Advanced (Production Tuning)

valkeyConfiguration:
  enabled: true
  host: valkey.internal.prod
  port: 6379
  password: ${VALKEY_PASSWORD}
  database: 0
  maxTotal: 100              # Max connections in pool
  maxIdle: 50                # Max idle connections
  minIdle: 25                # Min idle connections
  timeoutMs: 5000            # Connection timeout
  cacheTtlSeconds: 3600      # 1 hour TTL for long-running queries

Single Instance (No Changes Required)

valkeyConfiguration:
   enabled: false  # Default - local cache sufficient

##Testing

Unit Tests

TestValkeyConfiguration

  • Default values verification
  • Setter/getter correctness

TestValkeyDistributedCache (2 tests)

  • testDisabledCache() - Verifies disabled cache returns empty
  • testNoopDistributedCache() - Tests noop implementation

Integration Tests

TestValkeyDistributedCacheIntegration (9 comprehensive tests using TestContainers)

  • testValkeyConnectionAndBasicOperations() - Basic get/set/invalidate
  • testUpdateQueryIdCachesAllThreeValues() - Verifies all 3 values cached via updateQueryIdCache()
  • testRoutingGroupL2Caching() - L1 miss → L2 hit for routing_group
  • testExternalUrlL2Caching() - L1 miss → L2 hit for external_url
  • testThreeTierCacheLookupForBackend() - L1 miss → L2 hit scenario
  • testCacheBackfillFromDatabase() - L1 miss → L2 miss → L3 hit → backfills L2
  • testMultipleQueryIdsWithDifferentValues() - Multiple concurrent queries
  • testCacheOverwrite() - Cache update behavior
  • testEmptyStringValues() - Edge case handling

TestRoutingManagerExternalUrlCache (6 tests)

  • Tests external URL caching with mocked QueryHistoryManager
  • Verifies L1/L2 cache coordination
  • Tests cache miss fallback to query history

TestContainers Setup

  • Added createValkeyContainer() to TestcontainersUtils
  • Spins up real PostgreSQL and Valkey containers
  • Tests complete 3-tier caching flow end-to-end

Test Results

  • 194 tests total (routing package), all passing
  • Integration tests verify real Valkey connectivity
  • No regression in existing functionality
  • 0 Checkstyle violations

##Backward Compatibility

✅ Fully backward compatible

  • Disabled by default (enabled: false)
  • No changes required to existing configs
  • Single-instance deployments work exactly as before
  • Existing tests pass without modification

Migration Path

From Single to Multi-Gateway:

  1. Deploy Valkey server
    docker run -d -p 6379:6379 valkey/valkey:latest
  2. Update config.yaml on all gateways
    valkeyConfiguration:
    enabled: true
    host: valkey.internal
    port: 6379
    password: ${VALKEY_PASSWORD}
  3. Rolling restart gateways
  4. Verify cache is working

Check Valkey keys

docker exec valkey valkey-cli KEYS "trino:query:*"

No data migration needed - cache populates automatically.

##Graceful Degradation

When Valkey is unavailable:

  • ✅ Queries continue working (falls back to L1 and L3)
  • ✅ Falls back to database lookups
  • ✅ Logs warnings (not errors)
  • ✅ Auto-recovery when Valkey returns

Dependencies

Added:

  • io.valkey:valkey-java:5.5.0
    • Valkey is a Redis fork with compatible protocol
    • Works with both Valkey and Redis servers
    • Apache 2.0 licensed
    • Modern, actively maintained

###Code Quality Improvements

New Files (8)

Core Implementation:

  • gateway-ha/src/main/java/io/trino/gateway/ha/config/ValkeyConfiguration.java (121 lines)
  • gateway-ha/src/main/java/io/trino/gateway/ha/cache/Cache.java (40 lines)
  • gateway-ha/src/main/java/io/trino/gateway/ha/cache/ValkeyDistributedCache.java (156 lines)
  • gateway-ha/src/main/java/io/trino/gateway/ha/cache/QueryCacheManager.java (184 lines) - NEW

Tests:

  • gateway-ha/src/test/java/io/trino/gateway/ha/config/TestValkeyConfiguration.java (71 lines)
  • gateway-ha/src/test/java/io/trino/gateway/ha/cache/NoopDistributedCache.java (47 lines)
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCache.java (44 lines)
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/TestValkeyDistributedCacheIntegration.java (267 lines)

Modified Files (10)

Configuration:

  • gateway-ha/src/main/java/io/trino/gateway/ha/config/HaGatewayConfiguration.java - Added ValkeyConfiguration field

Core:

  • gateway-ha/src/main/java/io/trino/gateway/ha/module/HaGatewayProviderModule.java - Added Cache provider with cacheTtlSeconds
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/BaseRoutingManager.java - Refactored to use QueryCacheManager
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/QueryCountBasedRouter.java - Updated to use Cache interface
  • gateway-ha/src/main/java/io/trino/gateway/ha/router/StochasticRoutingManager.java - Updated to use Cache interface
  • gateway-ha/src/main/java/io/trino/gateway/proxyserver/ProxyRequestHandler.java - Cache all 3 values on query submission

Build:

  • gateway-ha/pom.xml - Added valkey-java dependency

Tests:

  • gateway-ha/src/test/java/io/trino/gateway/ha/util/TestcontainersUtils.java - Added createValkeyContainer()
  • gateway-ha/src/test/java/io/trino/gateway/ha/router/TestRoutingManagerExternalUrlCache.java - Updated to use NoopDistributedCache
  • 6 additional test files updated to use Cache interface and new package structure

Future Enhancements

  • Add cache metrics tracking and exposure via /metrics endpoint
  • Add TLS/SSL support for Valkey connections
  • Support Redis Cluster mode for high availability
  • Implement cache warming on startup
  • Add circuit breaker pattern for cache failures
  • Implement cache eviction strategies beyond TTL

@cla-bot cla-bot bot added the cla-signed label Jan 27, 2026
- Fixed cacheTtlSeconds configuration not being used in ValkeyDistributedCache
- Refactored repetitive distributedCache.isEnabled() checks into helper methods
- Created QueryCacheManager to encapsulate cache management logic
- Moved all cache classes to dedicated io.trino.gateway.ha.cache package
- Renamed DistributedCache interface to Cache for better abstraction

These changes provide better separation of concerns and make the caching
infrastructure more maintainable and reusable across the gateway.
@hpopuri2 hpopuri2 requested a review from kbhatianr January 28, 2026 10:46
@Singleton
public Cache getDistributedCache()
{
ValkeyConfiguration valkeyConfig = configuration.getValkeyConfiguration();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of referencing configuration directly, you should inject in HaGatewayConfiguration


@Provides
@Singleton
public ValkeyConfiguration getValkeyConfiguration()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inject in HaGatewayConfiguration here as well.

private final long cacheTtlSeconds;

public ValkeyDistributedCache(
String host,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of taking in all of these attributes, wouldn't it be beneficial to take in the ValkeyConfiguration object directly?

return queryIdExternalUrlCache.get(queryId);
}

// L2 Cache Operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does L1 vs L2 matter here? imo, this file shouldn't care about specific keys.

this.queryIdBackendCache = buildCache(this::findBackendForUnknownQueryId);
this.queryIdRoutingGroupCache = buildCache(this::findRoutingGroupForUnknownQueryId);
this.queryIdExternalUrlCache = buildCache(this::findExternalUrlForUnknownQueryId);
this.queryCacheManager = new QueryCacheManager(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QueryCacheManager should be injected into the constructor, not instantiated within the constructor. For dependency injection purposes.

{
String backend;

// L2: Check Valkey distributed cache if enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole purpose of moving to the CacheManager is to not have to individually check L1 vs L2 vs ... you simply do a get and it'll return the value if it exists or not if it does not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants